Search CORE

94 research outputs found

A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures

Author: Buttari Alfredo
Dongarra Jack
Kurzak Jakub
Langou Julien
Publication venue
Publication date: 01/01/2007
Field of study

As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations

arXiv.org e-Print Archive

CiteSeerX

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

MIMS EPrints

The University of Manchester - Institutional Repository

Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs

Author: Anzt Hartwig
Dongarra Jack
Gates Mark
Kurzak Jakub
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/07/2016
Field of study

The University of Manchester - Institutional Repository

Performance of random sampling for computing low-rank approximations of a dense matrix on GPUs

Author: Dongarra Jack
Kurzak Jakub
Luszczek Piotr
Mary Théo
Tomov Stanimire
Yamazaki Ichitaro
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 15/11/2015
Field of study

International audienceA low-rank approximation of a dense matrix plays an important role in many applications. To compute such an approximation , a common approach uses the QR factorization with column pivoting (QRCP). Though the reliability and efficiency of QRCP have been demonstrated, this determin-istic approach requires costly communication at each step of the factorization. Since such communication is becoming increasingly expensive on modern computers, an alternative approach based on random sampling, which can be implemented using communication-optimal kernels, is becoming attractive. To study its potential, in this paper, we compare the performance of random sampling with that of QRCP on an NVIDIA Kepler GPU. Our performance results demonstrate that random sampling can be up to 12.8× faster than the deterministic approach for computing the approximation of the same accuracy. We also present the parallel scaling of the random sampling over multiple GPUs on a single compute node, showing a speedup of 3.8× over three Kepler GPUs. These results demonstrate the potential of the random sampling as an excellent computational tool for many applications, and its potential is likely to grow on the emerging computers with the increasing communication costs

Crossref

Divide and Conquer Symmetric Tridiagonal Eigensolver for Multicore Architectures

Author: Faverge Mathieu
Haidar Azzam
Kurzak Jakub
Pichon Grégoire
Publication venue: HAL CCSD
Publication date: 25/05/2015
Field of study

International audienceComputing eigenpairs of a symmetric matrix is a problem arising in many industrial applications, including quantum physics and finite-elements computation for automo-biles. A classical approach is to reduce the matrix to tridiagonal form before computing eigenpairs of the tridiagonal matrix. Then, a back-transformation allows one to obtain the final solution. Parallelism issues of the reduction stage have already been tackled in different shared-memory libraries. In this article, we focus on solving the tridiagonal eigenproblem, and we describe a novel implementation of the Divide and Conquer algorithm. The algorithm is expressed as a sequential task-flow, scheduled in an out-of-order fashion by a dynamic runtime which allows the programmer to play with tasks granularity. The resulting implementation is between two and five times faster than the equivalent routine from the INTEL MKL library, and outperforms the best MRRR implementation for many matrices

INRIA a CCSD electronic archive server

Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs

Author: Hartwig Anzt
Jack Dongarra
Jakub Kurzak
Mark Gates
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems

Author: Abalenkovs Maksims
Abdelfattah Ahmad
Dongarra Jack
Gates M.
Haidar A
Kurzak Jakub
Luszczek Piotr
Tomov Stanimire
Yamazaki I.
YarKhan A.
Publication venue: 'FSAEIHE South Ural State University (National Research University)'
Publication date: 01/01/2015
Field of study

The University of Manchester - Institutional Repository

On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties

Author: Donfack Simplice
Dongarra Jack
Faverge Mathieu
Gates Mark
Kurzak Jakub
Luszczek Piotr
Yamazaki Ichitaro
Publication venue: HAL CCSD
Publication date: 01/01/2013
Field of study

Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of the Gaussian elimination. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multi-socket multicore systems are presented. Performance and numerical accuracy is analyzed

INRIA a CCSD electronic archive server